Clustering analysis is one of the most widely used statistical tools in manyemerging areas such as microarray data analysis. For microarray and otherhigh-dimensional data, the presence of many noise variables may mask underlyingclustering structures. Hence removing noise variables via variable selection isnecessary. For simultaneous variable selection and parameter estimation,existing penalized likelihood approaches in model-based clustering analysis allassume a common diagonal covariance matrix across clusters, which however maynot hold in practice. To analyze high-dimensional data, particularly those withrelatively low sample sizes, this article introduces a novel approach thatshrinks the variances together with means, in a more general situation withcluster-specific (diagonal) covariance matrices. Furthermore, selection ofgrouped variables via inclusion or exclusion of a group of variables altogetheris permitted by a specific form of penalty, which facilitates incorporatingsubject-matter knowledge, such as gene functions in clustering microarraysamples for disease subtype discovery. For implementation, EM algorithms arederived for parameter estimation, in which the M-steps clearly demonstrate theeffects of shrinkage and thresholding. Numerical examples, including anapplication to acute leukemia subtype discovery with microarray gene expressiondata, are provided to demonstrate the utility and advantage of the proposedmethod.
展开▼